177 research outputs found
An Efficient Algorithm for Mining Frequent Sequence with Constraint Programming
The main advantage of Constraint Programming (CP) approaches for sequential
pattern mining (SPM) is their modularity, which includes the ability to add new
constraints (regular expressions, length restrictions, etc). The current best
CP approach for SPM uses a global constraint (module) that computes the
projected database and enforces the minimum frequency; it does this with a
filtering algorithm similar to the PrefixSpan method. However, the resulting
system is not as scalable as some of the most advanced mining systems like
Zaki's cSPADE. We show how, using techniques from both data mining and CP, one
can use a generic constraint solver and yet outperform existing specialized
systems. This is mainly due to two improvements in the module that computes the
projected frequencies: first, computing the projected database can be sped up
by pre-computing the positions at which an symbol can become unsupported by a
sequence, thereby avoiding to scan the full sequence each time; and second by
taking inspiration from the trailing used in CP solvers to devise a
backtracking-aware data structure that allows fast incremental storing and
restoring of the projected database. Detailed experiments show how this
approach outperforms existing CP as well as specialized systems for SPM, and
that the gain in efficiency translates directly into increased efficiency for
other settings such as mining with regular expressions.Comment: frequent sequence mining, constraint programmin
Constraint-based sequence mining using constraint programming
The goal of constraint-based sequence mining is to find sequences of symbols
that are included in a large number of input sequences and that satisfy some
constraints specified by the user. Many constraints have been proposed in the
literature, but a general framework is still missing. We investigate the use of
constraint programming as general framework for this task. We first identify
four categories of constraints that are applicable to sequence mining. We then
propose two constraint programming formulations. The first formulation
introduces a new global constraint called exists-embedding. This formulation is
the most efficient but does not support one type of constraint. To support such
constraints, we develop a second formulation that is more general but incurs
more overhead. Both formulations can use the projected database technique used
in specialised algorithms. Experiments demonstrate the flexibility towards
constraint-based settings and compare the approach to existing methods.Comment: In Integration of AI and OR Techniques in Constraint Programming
(CPAIOR), 201
Flexible constrained sampling with guarantees for pattern mining
Pattern sampling has been proposed as a potential solution to the infamous
pattern explosion. Instead of enumerating all patterns that satisfy the
constraints, individual patterns are sampled proportional to a given quality
measure. Several sampling algorithms have been proposed, but each of them has
its limitations when it comes to 1) flexibility in terms of quality measures
and constraints that can be used, and/or 2) guarantees with respect to sampling
accuracy. We therefore present Flexics, the first flexible pattern sampler that
supports a broad class of quality measures and constraints, while providing
strong guarantees regarding sampling accuracy. To achieve this, we leverage the
perspective on pattern mining as a constraint satisfaction problem and build
upon the latest advances in sampling solutions in SAT as well as existing
pattern mining algorithms. Furthermore, the proposed algorithm is applicable to
a variety of pattern languages, which allows us to introduce and tackle the
novel task of sampling sets of patterns. We introduce and empirically evaluate
two variants of Flexics: 1) a generic variant that addresses the well-known
itemset sampling task and the novel pattern set sampling task as well as a wide
range of expressive constraints within these tasks, and 2) a specialized
variant that exploits existing frequent itemset techniques to achieve
substantial speed-ups. Experiments show that Flexics is both accurate and
efficient, making it a useful tool for pattern-based data exploration.Comment: Accepted for publication in Data Mining & Knowledge Discovery journal
(ECML/PKDD 2017 journal track
Constraint Programming for Multi-criteria Conceptual Clustering
International audienceA conceptual clustering is a set of formal concepts (i.e., closed itemsets) that defines a partition of a set of transactions. Finding a conceptual clustering is an N P-complete problem for which Constraint Programming (CP) and Integer Linear Programming (ILP) approaches have been recently proposed. We introduce new CP models to solve this problem: a pure CP model that uses set constraints, and an hybrid model that uses a data mining tool to extract formal concepts in a preprocessing step and then uses CP to select a subset of formal concepts that defines a partition. We compare our new models with recent CP and ILP approaches on classical machine learning instances. We also introduce a new set of instances coming from a real application case, which aims at extracting setting concepts from an Enterprise Resource Planning (ERP) software. We consider two classic criteria to optimize, i.e., the frequency and the size. We show that these criteria lead to extreme solutions with either very few small formal concepts or many large formal concepts, and that compromise clusterings may be obtained by computing the Pareto front of non dominated clusterings
Prefix-Projection Global Constraint for Sequential Pattern Mining
Sequential pattern mining under constraints is a challenging data mining
task. Many efficient ad hoc methods have been developed for mining sequential
patterns, but they are all suffering from a lack of genericity. Recent works
have investigated Constraint Programming (CP) methods, but they are not still
effective because of their encoding. In this paper, we propose a global
constraint based on the projected databases principle which remedies to this
drawback. Experiments show that our approach clearly outperforms CP approaches
and competes well with ad hoc methods on large datasets
Competition and facilitation between the marine nitrogen-fixing <i>cyanobacterium</i> Cyanothece and its associated bacterial community
N2-fixing cyanobacteria represent a major source of new nitrogen and carbon for marine microbial communities, but little is known about their ecological interactions with associated microbiota. In this study we investigated the interactions between the unicellular N2-fixing cyanobacterium Cyanothece sp. Miami BG043511 and its associated free-living chemotrophic bacteria at different concentrations of nitrate and dissolved organic carbon and different temperatures. High temperature strongly stimulated the growth of Cyanothece, but had less effect on the growth and community composition of the chemotrophic bacteria. Conversely, nitrate and carbon addition did not significantly increase the abundance of Cyanothece, but strongly affected the abundance and species composition of the associated chemotrophic bacteria. In nitrate-free medium the associated bacterial community was co-dominated by the putative diazotroph Mesorhizobium and the putative aerobic anoxygenic phototroph Erythrobacter and after addition of organic carbon also by the Flavobacterium Muricauda. Addition of nitrate shifted the composition toward co-dominance by Erythrobacter and the Gammaproteobacterium Marinobacter. Our results indicate that Cyanothece modified the species composition of its associated bacteria through a combination of competition and facilitation. Furthermore, within the bacterial community, niche differentiation appeared to play an important role, contributing to the coexistence of a variety of different functional groups. An important implication of these findings is that changes in nitrogen and carbon availability due to, e.g., eutrophication and climate change are likely to have a major impact on the species composition of the bacterial community associated with N2-fixing cyanobacteria
An index to quantify an individual's scientific research output that takes into account the effect of multiple coauthorship
I propose the index ("hbar"), defined as the number of papers of an
individual that have citation count larger than or equal to the of all
coauthors of each paper, as a useful index to characterize the scientific
output of a researcher that takes into account the effect of multiple
coauthorship. The bar is higher for .Comment: A few minor changes from v1. To be published in Scientometric
- …